Group L04G03
The University of Sydney
In recent years, the housing market has become a central topic of interest, as property prices have skyrocketed across the United States. New York is now in top 10 of the most expensive city in the world. This has raised important questions for homeowners, real estate investors, and city planners alike:
Note
What are the key factors that drive the value of a home?
Data resource: Data on houses in Saratoga County, New York, USA in 2006
Data structure: 1734 observations on 17 variables. Test variable was ignored as its meaning was unknown.
- Price = price of the house
- Lot.Size = size of the house’s lot in acres
- Age = age of the house in years
- Land.Value = value of land (in $USD)
- Living.Area = living area in square feet
- Pct.College = percentage of neighborhood that graduated college
- Bedrooms = number of bedrooms
- Fireplaces = number of fireplaces
- Bathrooms = number of bathrooms
- Rooms = number of rooms
- Heating.Type = type of heating system
- Fuel.Type = type of fuel used for heating
- Sewer.Type = type of sewer system
- Waterfront = whether the property includes waterfront
- New.Construction = whether the property is a new construction
- Central.Air = whether the house has central air
In our project statement, Price is the dependent variable to be predicted and all other variables are considered independent variables.
Based on the Heat Map, Full Model and Stepwise Model, we choose these variables as independent variables: Lot.Size, Waterfront, Land.Value, New.Construct, Living.Area, Bathroom
Heat Map
In our project statement, Price is the dependent variable to be predicted and all other variables are considered independent variables.
Based on the Heat Map, Full Model and Stepwise Model, we choose these variables as independent variables: Lot.Size, Waterfront, Land.Value, New.Construct, Living.Area, Bathroom
Full Model
In our project statement, Price is the dependent variable to be predicted and all other variables are considered independent variables.
Based on the Heat Map, Full Model and Stepwise Model, we choose these variables as independent variables: Lot.Size, Waterfront, Land.Value, New.Construct, Living.Area, Bathroom
Stepwise
To complete the linear regression analysis and determine whether to apply logarithmic transformations, we examined the selected variables and made the following decisions:
Log Land.Value and Living.Area: Large ranges and right skewed, improving model stability.
No log for Waterfront, Central.Air, New.Construct: Binary variables unsuitable for logging.
No log for Lot.Size and Bathrooms: Small ranges or contain zeros, not suitable for logging.
| Statistic | Lot_Size | Waterfront | Land_Value | Central_Air | New_Construct | Living_Area | Bathrooms |
|---|---|---|---|---|---|---|---|
| Min. | 0.0000 | 0.00000 | 200 | 0.0000 | 0.00000 | 616 | 0.000 |
| 1st Qu. | 0.1700 | 0.00000 | 15100 | 0.0000 | 0.00000 | 1300 | 1.500 |
| Median | 0.3700 | 0.00000 | 25000 | 0.0000 | 0.00000 | 1632 | 2.000 |
| Mean | 0.5003 | 0.00865 | 34536 | 0.3662 | 0.04671 | 1753 | 1.899 |
| 3rd Qu. | 0.5400 | 0.00000 | 40200 | 1.0000 | 0.00000 | 2134 | 2.500 |
| Max. | 12.2000 | 1.00000 | 412600 | 1.0000 | 1.00000 | 5228 | 4.500 |
In order to check the correlation between price and other variables affecting price, we chose the following model to complete the linear regression analysis:
For the final model, we chose the Log - Log model because it had the lowest RMSE and MAE and the second highest R-squared.
| Model | RMSE | R-squared | MAE |
|---|---|---|---|
| Linear – Linear model | 58752.03 | 0.645126 | 42041.38 |
| Linear – Log model | 63737.29 | 0.5837296 | 45835.12 |
| Log – Linear model | 0.2986343 | 0.5741902 | 0.2117078 |
| Log – Log model | 0.2937391 | 0.5870239 | 0.2109565 |
The general form of the log-log regression equation is:
\[\begin{aligned} \log(\text{Price}) &= \beta_0 + \beta_1 (\text{Lot.Size}) + \beta_2 \text{Waterfront} \\ &\quad + \beta_3 \log(\text{Land.Value}) + \beta_4 \text{New.Construct} \\ &\quad + \beta_5 \log(\text{Living.Area}) + \beta_6 \text{Bathrooms} \end{aligned}\]Using the given coefficients, the formula becomes:
Multicollinearity: High correlation between living area and bedrooms (coefficient of 0.73) can lead to inflated standard errors which makes it difficult to assess individual significance
Violation of assumptions
Independence: D-W statistic of 1.5472 suggest slight autocorrelation between residuals
Normality: Outliers in the dataset affected the Q-Q plot, especially at the tails, indicating a deviation froma. normal distribution
Homoskedasticity: Residual plots indicate such, meaning the variance of errors increased with certain values, violating the assumption of constant variance
Model Complexity: The inclusion of multiple variables increases the complexity of the model, making it harder to interpret and potentially overfitting the data to this specific dataset.
Data Limitations:
The data used is from a specific geographic location (Saratoga County, NY), limiting the generalizability of the model to other regions or time periods.
Some important variables (e.g., economic factors like interest rates, inflation, or proximity to amenities) may not have been included, which could impact property prices.
Property Valuation:
Mortgage Lending & Risk Assessment:
Decision-Making for Stakeholders:
Market Trend Insights:
Provides insights into housing market trends and informs policy development.
Investing in properties with a waterfront view or new construction could be investigated to learn further influence.
The modelling aimed to predict house prices using data from Saratoga County, New York
The final model chosen, a log-log regression, performed the best in terms of RMSE and MAE
Key findings showed that Waterfront, Living Are, and Bathrooms were significant predictors
The formula derived can predict property valuation, assisting in a variety of applications:
Property valuation, mortgage lending, risk assessment
Decision-making support for real estate professionals, buyers and sellers